Create function to assess valueset accuracy #175

robertandremitchell · 2025-12-08T16:06:12Z

Description

Creates a function that expects a small payload of data on the returned/expected LOINC to evaluate degree of accuracy in our algorithm. Definitions of how we are currently considering accuracy are outlined here: https://docs.google.com/document/d/1yA5NJ06mf1EfLZRmNrrNKopWL6ExMj-dPYKy8wlVDGs/edit?tab=t.0#heading=h.b1r0q3mit8hy

This updates additional parts of the code to add the expected LOINC to the example data. HOWEVER, open questions:

do we expect the returned codes to always be of one particular type (long name, common name, etc)
if so, is the intention to make a call to the LOINC API or our internal CSV to try to match it to a code?

I have a dummy notebook with small edits to the performance.ipynb (here: https://ml.azure.com/fileexplorerAzNB?wsid=/subscriptions/6848426c-8ca8-4832-b493-fed851be1f95/resourcegroups/dibbs-ttc-training/providers/Microsoft.MachineLearningServices/workspaces/dibbsttc&tid=28cf58df-efe8-4135-b2d1-f697ee74c00c&activeFilePath=Users/robert.a.mitchell/performance-copy.ipynb&notebookPivot=0) that does the following:

dynamic updates to the code based on model name
switches from un-pickling to loading in JSONL
outputs a JSONL that looks like the below:

{"example_idx": 0, "query_input": "Hester Davis fall risk scale", "expected_label": "Hester Davis fall risk scale", "k": 10, "encoding_time_s": 0.32662534713745117, "search_time_s": 0.0002624988555908203, "expected_match": {"rank": null, "score": null, "is_correct_in_topk": false, "is_correct_top1": false}, "results": [{"rank": 1, "corpus_id": 934, "label": "Coronavirus anxiety scale", "loinc_type": "Order", "score": 0.8334473371505737}, {"rank": 2, "corpus_id": 70, "label": "Abbreviated Injury Scale panel AAAM", "loinc_type": "Order", "score": 0.8256902694702148}, {"rank": 3, "corpus_id": 886, "label": "Goal attainment scale - Reported", "loinc_type": "Order", "score": 0.8089544177055359}, {"rank": 4, "corpus_id": 22, "label": "17-Hydroxyprogesterone [Measurement] in DBS", "loinc_type": "Order", "score": 0.8040225505828857}, {"rank": 5, "corpus_id": 712, "label": "Bacterial susceptibility panel by Disk diffusion (KB)", "loinc_type": "Order", "score": 0.8019613027572632}, {"rank": 6, "corpus_id": 117, "label": "Active range of motion panel Quantitative", "loinc_type": "Order", "score": 0.7989861369132996}, {"rank": 7, "corpus_id": 166, "label": "ADL functional rehabilitation potential Set", "loinc_type": "Order", "score": 0.7988804578781128}, {"rank": 8, "corpus_id": 676, "label": "Cholinesterase activity panel - Serum or Plasma", "loinc_type": "Order", "score": 0.7974450588226318}, {"rank": 9, "corpus_id": 906, "label": "Centers for Environmental Health trace metals screen panel [Mass/volume] - Urine", "loinc_type": "Order", "score": 0.7957779169082642}, {"rank": 10, "corpus_id": 499, "label": "Anemia evaluation panel - Serum or Blood", "loinc_type": "Order", "score": 0.7948352098464966}]}
{"example_idx": 1, "query_input": "F9 gene familial mut Doc analysis molecular genetics (Bld/Tiss)", "expected_label": "F9 gene familial mut analysis Molgen Doc (Bld/Tiss)", "k": 1, "encoding_time_s": 0.5327327251434326, "search_time_s": 0.0006182193756103516, "expected_match": {"rank": null, "score": null, "is_correct_in_topk": false, "is_correct_top1": false}, "results": [{"rank": 1, "corpus_id": 885, "label": "Glycosylation congenital disorders multigene analysis in Blood or Tissue by Molecular genetics method", "loinc_type": "Order", "score": 0.8319992423057556}]}

I've only run it on a small fraction of 1/283 files. I think ideally we would have on the examples file we load in the LOINC ID and LOINC type so that we would be able to do a more comprehensive check of whether

The LOINC ID and/or name matches
If it does not match, is there a match amongst the high-probability matches that is closer on type (i.e., if the expected is a Order but there's a 85% that's a Observation but a 83% that's an Order, we would in theory want to use the 83%?)

For the sake of an end-to-end product, I've taken a small snippet of the data and ran the text fields against the LOINC API to get a LOINC ID. I've also rewritten the scripts for generating key-pairs for examples to include the LOINC codes so that in future runs we can skip a call to the LOINC API.

Related Issues

Closes #173

Additional Notes

The logic of the third-degree match is still a tad shaky. The sample data shows two different kinds: one where the LOINCs and OIDs differ but connect to the same condition and another where the LOINCs and OIDs differ but connect to several conditions, all the same.

Related to this code itself, I think the only other function we may want to add down the line is either a function to transition the data we need from the matching protocol into the right shape, but that should be relatively straightforward since this script really only needs two columns worth of data.

Checklist

Please review and complete the following checklist before submitting your pull request:

I have ensured that the pull request is of a manageable size, allowing it to be reviewed within a single session.
I have reviewed my changes to ensure they are clear, concise, and well-documented.
I have updated the documentation, if applicable.
I have added or updated test cases to cover my changes, if applicable.
I have minimized the number of reviewers to include only those essential for the review.

Checklist for Reviewers

Please review and complete the following checklist during the review process:

The code follows best practices and conventions.
The changes implement the desired functionality or fix the reported issue.
The tests cover the new changes and pass successfully.
Any potential edge cases or error scenarios have been considered.

…cy evaulation / corner cases

codecov-commenter · 2025-12-08T16:10:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.62%. Comparing base (dd3153d) to head (bc55c64).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #175      +/-   ##
==========================================
+ Coverage   93.58%   93.62%   +0.04%     
==========================================
  Files          17       17              
  Lines         561      596      +35     
==========================================
+ Hits          525      558      +33     
- Misses         36       38       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…accuracy

…scripts

…accuracy

data/accuracy_evaluation/sample_data/sample_evaluation_file.txt

bamader

Logic looks mostly good here, but I have some questions to make sure I understand what's going on.

What are the changes being made to the azure_scripts/performance.ipynb notebook? From what I can tell, it looks like cell formatting and some cell output (which we don't want in the github repo anyway), but just wondering if there's been any logic changes. This is the notebook we use to compute our performance stats right now so I want to make sure it doesn't break. It doesn't look like there's any of the valueset accuracy stuff added to it, so if we want to document that as a second notebook that might be my preference?
Out results on most of these tests look absolutely terrible, which is giving me a bit of concern. We were hitting 65% accuracy on our earlier tests, but here using the same test case generation methodology, I can see 8 total cases where we find the right answer out of the hundreds/thousands you're running. That seems like something deeper is wrong to me. How is eval_results_snipped_with_loinc_codes.txt being generated? Are you getting these results using the regular performance notebook, or the copy you have in Azure? I feel like something has to be misaligned somewhere because even if we're not getting the right valuesets when the model is wrong, the rate we're getting regular codes flat-out wrong suggests something off to me.

bamader · 2026-01-07T18:20:11Z

data/accuracy_evaluation/create_eval_file.py

+file_path = (
+    "/Users/rob/dibbs-text-to-code/data/accuracy_evaluation/eval_results_snippet_with_codes.jsonl"
+)


Flagging that we'll need to change this before merging

robertandremitchell · 2026-01-12T15:15:19Z

Logic looks mostly good here, but I have some questions to make sure I understand what's going on.

What are the changes being made to the azure_scripts/performance.ipynb notebook? From what I can tell, it looks like cell formatting and some cell output (which we don't want in the github repo anyway), but just wondering if there's been any logic changes. This is the notebook we use to compute our performance stats right now so I want to make sure it doesn't break. It doesn't look like there's any of the valueset accuracy stuff added to it, so if we want to document that as a second notebook that might be my preference?

Out results on most of these tests look absolutely terrible, which is giving me a bit of concern. We were hitting 65% accuracy on our earlier tests, but here using the same test case generation methodology, I can see 8 total cases where we find the right answer out of the hundreds/thousands you're running. That seems like something deeper is wrong to me. How is eval_results_snipped_with_loinc_codes.txt being generated? Are you getting these results using the regular performance notebook, or the copy you have in Azure? I feel like something has to be misaligned somewhere because even if we're not getting the right valuesets when the model is wrong, the rate we're getting regular codes flat-out wrong suggests something off to me.

The primary changes are: 1) changing from pickle to jsonl, and in the final step and 2) outputting the results of each run into jsonls in the local folder. I don't think it introduces any change to how performance stats are calculated. I'm not opposed to breaking this out into another notebook to avoid multiple uses/updates being tied to one notebook.
So far, I've only run this against the JSONLs generated back in December that you flagged an issue with. So I'm not sure if the accuracy would be impacted by that issue. I can re-review the notebook to see if there's anything wrong with how that loop through is going.

…accuracy

bamader · 2026-01-12T17:05:52Z

@robertandremitchell Yeah let's split it out into a second notebook. JSONL is super slow for general performance so this'll save us the hassle of switching back and forth when we want to do fine-tuning runs. Also, before uploading the new notebook, can you hit the Clear All Outputs option so we don't commit stuff like the pip install log? Should be accessible from the three dots at the top left of the notebook editing window in Azure.

As far as the accuracy goes, I don't think the performance should really be affected that much by those nulls--I just don't understand why we can hit 65+% on our regular runs with the validation set but then for you it's showing up as minimal. Especially because last week I sanity checked the JSONL files by doing a performance run where I loaded from JSONL and then still achieved our expected performance numbers. Can you ping me when your split-out notebook is updated in Azure? I think looking at it there might help me see if something is up.

…racy notebook

robertandremitchell added 4 commits December 5, 2025 13:27

changing var name per suggestion

299aa9a

Sample accuracy evaluation function

204c037

tinkering with second-degree match logic

111567a

dynamic file naming, creating utils.py, continuied thinking on accura…

f20d3d3

…cy evaulation / corner cases

robertandremitchell linked an issue Dec 8, 2025 that may be closed by this pull request

Create function(s) to assess valueset accuracy #173

Open

2 tasks

utils support for jsonl

e2b697c

m-goggins added the Algorithm Development Tasks related to training, testing, evaluating and improving language models label Dec 10, 2025

JNygaard-Skylight and others added 4 commits December 10, 2025 14:12

Merge branch 'main' into rob/173-create-functions-to-assess-valueset-…

5158357

…accuracy

creating potential new validation pair files

ce6f902

Merge branch 'main' into rob/173-create-functions-to-assess-valueset-…

427b5f1

…accuracy

using sample snippet of data from performance notebook

e05e4aa

robertandremitchell marked this pull request as ready for review January 6, 2026 18:51

robertandremitchell requested review from BradySkylight, JNygaard-Skylight, bamader and m-goggins as code owners January 6, 2026 18:51

robertandremitchell added 2 commits January 7, 2026 15:01

removing local paths

ee1eb87

changing logic to properly track LOINC codes, refactoring evaluation …

4fb9b65

…scripts

robertandremitchell marked this pull request as draft January 8, 2026 19:19

robertandremitchell added 4 commits January 8, 2026 15:09

updating sample evaluation file

8385d28

Merge branch 'main' into rob/173-create-functions-to-assess-valueset-…

202ace4

…accuracy

reorganizing and renaming files for clarity

283325e

updated readme

45009ef

robertandremitchell marked this pull request as ready for review January 8, 2026 21:00

saving performance notebook updates that use JSONL

a5b4c54

m-goggins reviewed Jan 9, 2026

View reviewed changes

data/accuracy_evaluation/sample_data/sample_evaluation_file.txt Show resolved Hide resolved

bamader reviewed Jan 12, 2026

View reviewed changes

Merge branch 'main' into rob/173-create-functions-to-assess-valueset-…

da06cf8

…accuracy

robertandremitchell added 2 commits January 14, 2026 10:59

Rolling back changes to performance notebook, adding new changes accu…

b58e763

…racy notebook

adding checks to speed up accuracy analysis, more dynamic in setup

bc55c64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create function to assess valueset accuracy #175

Create function to assess valueset accuracy #175

Uh oh!

robertandremitchell commented Dec 8, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Dec 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

bamader left a comment

Uh oh!

bamader Jan 7, 2026

Uh oh!

robertandremitchell commented Jan 12, 2026

Uh oh!

bamader commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Create function to assess valueset accuracy #175

Are you sure you want to change the base?

Create function to assess valueset accuracy #175

Uh oh!

Conversation

robertandremitchell commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Additional Notes

Checklist

Checklist for Reviewers

Uh oh!

codecov-commenter commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

bamader left a comment

Choose a reason for hiding this comment

Uh oh!

bamader Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

robertandremitchell commented Jan 12, 2026

Uh oh!

bamader commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

robertandremitchell commented Dec 8, 2025 •

edited

Loading

codecov-commenter commented Dec 8, 2025 •

edited

Loading